Skip to content

Conversation

@sjyango
Copy link
Contributor

@sjyango sjyango commented Jan 4, 2026

This PR references ByteDance's implementation. Co-Author: @WencongLiu

Description

This PR introduces native support for Python User-Defined Functions (UDF), User-Defined Aggregate Functions (UDAF), and User-Defined Table Functions (UDTF) in Doris, enabling users to extend SQL capabilities with custom Python logic for complex data processing scenarios.

Key Features

🚀 Three Function Types

  • UDF: Scalar functions with row-by-row or vectorized execution (3-10x performance gain with Pandas/Arrow mode)
  • UDAF: Snowflake-style stateful aggregation with distributed merge support
  • UDTF: Table-valued functions that generate multiple output rows from a single input row

🔧 Production-Grade Architecture

  • High-Performance Communication: Arrow Flight RPC over Unix sockets with zero-copy columnar data transfer
  • Multi-Version Support: Flexible environment management via Conda or venv, allowing different UDFs to use different Python versions (e.g., 3.9, 3.10, 3.12)

🎯 Deep Integration

  • Seamless integration with Doris vectorized execution engine
  • Native support for Doris data types (including complex types like ARRAY, MAP, STRUCT)
  • Automatic conversion between Doris types and Python/Arrow types

Architecture Highlights

┌──────────────────────────────────────────────────────┐
│  Doris BE (C++)                                      │
│  ┌────────────────────────────────────────────────┐  │
│  │  PythonServerManager (Process Pool)            │  │
│  │  ├─ Health Check Thread (60s interval)         │  │
│  │  ├─ Load Balancing (min ref count)             │  │
│  │  └─ Auto Recovery (dead process detection)     │  │
│  └────────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────────┐  │
│  │  PythonClient (Arrow Flight RPC)               │  │
│  │  ├─ UDF: Scalar/Vectorized evaluation          │  │
│  │  ├─ UDAF: Stateful aggregation with merge      │  │
│  │  └─ UDTF: ListArray batch processing           │  │
│  └────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────┘
                      ↕ Unix Socket
┌──────────────────────────────────────────────────────┐
│  Python Process (python_server.py)                   │
│  ┌────────────────────────────────────────────────┐  │
│  │  FlightServer (Arrow Flight bidirectional)     │  │
│  │  ├─ AdaptivePythonUDF (auto mode selection)    │  │
│  │  ├─ UDAFStateManager (Snowflake interface)     │  │
│  │  └─ UDFLoader (inline/module code execution)   │  │
│  └────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────┘

Configuration

Add to be.conf:

# Enable Python UDF support
enable_python_udf_support = true

# Choose environment management mode (conda or venv)
python_env_mode = conda

# For Conda mode
python_conda_root_path = /path/to/miniconda3

# For venv mode
python_env_mode = venv
python_venv_root_path = /doris/python_envs
python_venv_interpreter_paths = /opt/python3.9/bin/python3.9:/opt/python3.12/bin/python3.12

# Process pool size (0 = use CPU core count)
max_python_process_num = 0

Technical Highlights

1. Environment Management

  • Multi-version support: Each UDF can specify its own Python version
  • Two modes: Conda (full environment isolation) or venv (lightweight)
  • Automatic discovery: Scans available Python environments at BE startup

2. Process Pool Management

  • Shared pool: One pool per Python version, shared across all threads
  • Load balancing: Distributes requests to processes with minimum load
  • Health monitoring: Background thread checks process health every 60 seconds
  • Auto recovery: Automatically recreates dead processes

3. Communication Protocol

  • Arrow Flight RPC: High-performance, language-agnostic RPC framework
  • Unix Socket: Local IPC for minimal latency and enhanced security
  • Bidirectional streaming: Efficient batch data transfer

4. Execution Modes

  • Scalar mode: Process one value at a time (simple functions)
  • Vectorized mode: Process entire columns with NumPy/Pandas (10-100x faster)
  • Adaptive selection: Automatically chooses mode based on function signature

5. UDAF State Management (Snowflake Style)

  • 5 lifecycle methods: __init__, accumulate, merge, finish, aggregate_state
  • Distributed aggregation: Serialization/deserialization for shuffle operations
  • Efficient state handling: Place-based mapping avoids redundant transfers

Limitations

  1. Performance: Python UDFs are slower than native C++ built-in functions. Best suited for complex logic that's difficult to implement in SQL.
  2. Type support: Special Doris types like HLL and Bitmap are not yet supported.
  3. Concurrency: Parallelism is limited by max_python_process_num setting.

Related Documentation

  • Python UDF User Guide
  • Python UDAF User Guide
  • Python UDTF User Guide
  • Python Environment Configuration Guide

This PR enables users to leverage the rich Python ecosystem (NumPy, Pandas, scikit-learn, etc.) directly within Doris SQL queries, significantly expanding the platform's data processing capabilities.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Jan 4, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sjyango sjyango force-pushed the python_udf branch 2 times, most recently from fba1d68 to 58258db Compare January 6, 2026 05:51
@sjyango
Copy link
Contributor Author

sjyango commented Jan 6, 2026

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.57% (1780/2237)
Line Coverage 64.91% (31561/48624)
Region Coverage 65.46% (15701/23986)
Branch Coverage 56.06% (8345/14886)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/437) 🎉
Increment coverage report
Complete coverage report

@sjyango sjyango force-pushed the python_udf branch 3 times, most recently from 511c8ad to fd27a83 Compare January 8, 2026 14:41
@sjyango
Copy link
Contributor Author

sjyango commented Jan 8, 2026

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.57% (1784/2242)
Line Coverage 64.75% (31728/48997)
Region Coverage 65.43% (15782/24120)
Branch Coverage 56.02% (8385/14968)

@doris-robot
Copy link

TPC-H: Total hot run time: 32292 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fd27a83e92c9b225b457c2281ed688f086663e88, data reload: false

------ Round 1 ----------------------------------
q1	17628	4186	4081	4081
q2	2033	371	248	248
q3	10511	1284	726	726
q4	10371	901	333	333
q5	9689	2063	1990	1990
q6	246	171	143	143
q7	964	813	674	674
q8	9288	1444	1181	1181
q9	5106	4668	4585	4585
q10	6880	1807	1401	1401
q11	544	312	289	289
q12	726	745	605	605
q13	17810	3928	3108	3108
q14	291	301	289	289
q15	598	520	513	513
q16	706	709	652	652
q17	676	841	495	495
q18	6987	6523	7475	6523
q19	1177	1036	641	641
q20	451	430	249	249
q21	3174	2560	2589	2560
q22	1127	1091	1006	1006
Total cold run time: 106983 ms
Total hot run time: 32292 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4429	4273	4354	4273
q2	333	433	317	317
q3	2303	2830	2444	2444
q4	1526	1948	1633	1633
q5	4439	4468	4609	4468
q6	220	170	131	131
q7	1974	1904	1755	1755
q8	2662	2472	2425	2425
q9	7176	7303	7048	7048
q10	2489	2732	2283	2283
q11	527	464	446	446
q12	673	751	620	620
q13	3399	3867	3107	3107
q14	285	284	260	260
q15	536	484	489	484
q16	623	663	626	626
q17	1135	1378	1376	1376
q18	7453	7325	7200	7200
q19	858	833	836	833
q20	1919	1963	1820	1820
q21	4609	4301	4150	4150
q22	1071	1013	971	971
Total cold run time: 50639 ms
Total hot run time: 48670 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172159 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fd27a83e92c9b225b457c2281ed688f086663e88, data reload: false

query5	4445	591	439	439
query6	339	237	219	219
query7	4206	458	262	262
query8	325	252	256	252
query9	8761	2669	2689	2669
query10	520	380	333	333
query11	15167	15124	14834	14834
query12	178	113	117	113
query13	1266	493	383	383
query14	6157	3038	2785	2785
query14_1	2666	2676	2699	2676
query15	199	196	175	175
query16	989	485	458	458
query17	1112	699	592	592
query18	2489	423	339	339
query19	226	222	192	192
query20	118	113	113	113
query21	209	135	125	125
query22	3795	3990	3738	3738
query23	15909	15576	15295	15295
query23_1	15324	15373	15450	15373
query24	7430	1535	1160	1160
query24_1	1164	1148	1167	1148
query25	529	439	392	392
query26	1257	263	152	152
query27	2777	458	286	286
query28	4549	2143	2133	2133
query29	760	547	434	434
query30	316	246	222	222
query31	827	616	563	563
query32	77	79	71	71
query33	540	339	290	290
query34	928	872	529	529
query35	713	754	670	670
query36	865	883	744	744
query37	125	93	74	74
query38	2739	2677	2710	2677
query39	781	757	734	734
query39_1	729	699	698	698
query40	220	134	116	116
query41	64	61	62	61
query42	104	102	105	102
query43	467	432	425	425
query44	1319	719	718	718
query45	182	185	177	177
query46	851	969	581	581
query47	1369	1458	1353	1353
query48	312	316	239	239
query49	606	417	323	323
query50	641	289	204	204
query51	3769	3774	3744	3744
query52	110	110	101	101
query53	292	327	272	272
query54	281	248	247	247
query55	77	75	71	71
query56	303	293	300	293
query57	1049	971	953	953
query58	303	255	259	255
query59	1963	2143	2103	2103
query60	323	320	297	297
query61	161	167	158	158
query62	395	363	307	307
query63	299	274	271	271
query64	4956	1306	1019	1019
query65	3754	3736	3709	3709
query66	1435	412	302	302
query67	14863	15061	14397	14397
query68	2820	1029	735	735
query69	448	347	313	313
query70	1023	902	925	902
query71	320	289	274	274
query72	6233	3659	3668	3659
query73	599	730	303	303
query74	8741	8763	8650	8650
query75	2773	2828	2457	2457
query76	2870	1061	656	656
query77	359	383	287	287
query78	9785	10071	9170	9170
query79	1065	900	595	595
query80	1425	569	474	474
query81	552	265	227	227
query82	989	144	111	111
query83	362	260	239	239
query84	255	117	101	101
query85	1209	538	470	470
query86	407	303	290	290
query87	2885	2841	2794	2794
query88	3268	2233	2229	2229
query89	396	343	335	335
query90	1911	154	145	145
query91	172	164	147	147
query92	66	68	64	64
query93	1166	901	524	524
query94	628	323	282	282
query95	562	371	309	309
query96	587	463	210	210
query97	2314	2363	2304	2304
query98	213	202	195	195
query99	583	592	538	538
Total cold run time: 248003 ms
Total hot run time: 172159 ms

@sjyango
Copy link
Contributor Author

sjyango commented Jan 11, 2026

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.57% (1784/2242)
Line Coverage 64.76% (31730/48997)
Region Coverage 65.44% (15784/24120)
Branch Coverage 55.99% (8380/14968)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 1.37% (6/437) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 32627 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b4a679e26b9d31e300fe5d4b1a38088ef6448de0, data reload: false

------ Round 1 ----------------------------------
q1	17623	4215	4110	4110
q2	2043	362	243	243
q3	10527	1345	750	750
q4	10340	920	331	331
q5	9720	2171	1970	1970
q6	235	176	140	140
q7	985	831	667	667
q8	9277	1500	1285	1285
q9	5100	4681	4613	4613
q10	6853	1825	1414	1414
q11	533	312	306	306
q12	733	720	580	580
q13	17815	3929	3082	3082
q14	307	295	282	282
q15	592	522	506	506
q16	717	684	647	647
q17	677	862	540	540
q18	6772	6664	7455	6664
q19	1827	1021	638	638
q20	404	405	245	245
q21	3274	2624	2583	2583
q22	1078	1125	1031	1031
Total cold run time: 107432 ms
Total hot run time: 32627 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4321	4365	4405	4365
q2	335	406	322	322
q3	2279	2842	2373	2373
q4	1426	2040	1461	1461
q5	4546	4332	4463	4332
q6	212	166	133	133
q7	1985	1942	1776	1776
q8	2501	2449	2389	2389
q9	7205	6999	7465	6999
q10	2424	2711	2108	2108
q11	519	466	431	431
q12	672	705	575	575
q13	3359	3902	3137	3137
q14	258	283	260	260
q15	533	503	488	488
q16	616	651	595	595
q17	1118	1310	1330	1310
q18	7421	7276	7106	7106
q19	872	823	840	823
q20	1915	1978	1804	1804
q21	4604	4287	4127	4127
q22	1045	1005	970	970
Total cold run time: 50166 ms
Total hot run time: 47884 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172878 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b4a679e26b9d31e300fe5d4b1a38088ef6448de0, data reload: false

query5	4525	617	446	446
query6	345	223	222	222
query7	4211	465	271	271
query8	340	265	261	261
query9	8773	2644	2646	2644
query10	541	395	327	327
query11	15344	15252	14875	14875
query12	173	122	119	119
query13	1265	479	377	377
query14	6389	3036	2780	2780
query14_1	2691	2671	2700	2671
query15	209	198	175	175
query16	982	501	474	474
query17	1114	683	581	581
query18	2547	454	349	349
query19	238	223	198	198
query20	127	121	118	118
query21	217	142	126	126
query22	4090	4049	3931	3931
query23	16042	15534	15382	15382
query23_1	15539	15333	15587	15333
query24	7385	1592	1165	1165
query24_1	1238	1199	1216	1199
query25	549	481	417	417
query26	1246	276	163	163
query27	2764	477	299	299
query28	4497	2136	2120	2120
query29	822	586	490	490
query30	322	244	221	221
query31	798	621	556	556
query32	78	72	71	71
query33	590	345	286	286
query34	894	879	530	530
query35	744	740	693	693
query36	855	886	787	787
query37	128	92	77	77
query38	2747	2756	2669	2669
query39	778	745	728	728
query39_1	727	725	706	706
query40	216	130	112	112
query41	68	63	62	62
query42	106	101	103	101
query43	463	469	421	421
query44	1352	717	721	717
query45	190	181	172	172
query46	859	970	588	588
query47	1410	1471	1349	1349
query48	311	323	239	239
query49	618	411	331	331
query50	647	276	198	198
query51	3748	3796	3859	3796
query52	108	121	97	97
query53	290	349	271	271
query54	288	258	255	255
query55	79	75	72	72
query56	331	299	291	291
query57	1027	982	891	891
query58	265	251	254	251
query59	2171	2156	2023	2023
query60	322	314	297	297
query61	159	157	155	155
query62	398	348	343	343
query63	296	271	271	271
query64	4926	1296	973	973
query65	3833	3726	3736	3726
query66	1411	430	304	304
query67	15494	15465	14893	14893
query68	5920	1009	723	723
query69	511	371	313	313
query70	1052	963	925	925
query71	371	305	296	296
query72	5994	3395	3447	3395
query73	773	730	300	300
query74	8753	8757	8536	8536
query75	2848	2837	2507	2507
query76	3441	1088	632	632
query77	520	395	279	279
query78	9729	9756	9188	9188
query79	1473	895	581	581
query80	651	589	473	473
query81	530	265	234	234
query82	206	151	108	108
query83	258	258	237	237
query84	263	122	98	98
query85	894	518	442	442
query86	399	329	291	291
query87	2845	2845	2750	2750
query88	3148	2208	2206	2206
query89	396	355	322	322
query90	2208	150	150	150
query91	174	167	137	137
query92	81	68	65	65
query93	1487	897	531	531
query94	575	316	284	284
query95	569	379	309	309
query96	580	451	205	205
query97	2340	2383	2302	2302
query98	241	197	192	192
query99	586	599	510	510
Total cold run time: 253232 ms
Total hot run time: 172878 ms

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 3.15% (60/1905) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.79% (18986/35962)
Line Coverage 38.94% (175937/451847)
Region Coverage 33.51% (136383/406954)
Branch Coverage 34.52% (58894/170594)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants